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(57) A conputatlonally effident multiplication 
method and apparatus for modular exponentiation. The 
apparatus uses a preload register, coupled to a multi- 
plier at a second input port via a KN bit bus to load the 
value of the 'a** multiplicand in the multiplier in a single 
dock pulse. The t)" multiplicand (which is also KN bits 
long) is supplied to the multiplier N bits at a time from a 



memory output port via an N bit bus coupled to a multi- 
plier first Input port. The multiplier multiplies the N bits of 
the V multiplicand by the KN bits of the V multipficand 
and provides that product at a multiplier output N bits at 
a time, where it can be 8if)pfied to the memay via a 
memory input port. 
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Description 

r.RnsS-REFgRENCE TO RELATED APPUCATIONS 

5 [0001 ] This application is related to co-pending and commonly assigned appficatlon serial number 08/828.368, enti- 
tled "High-Speed Modular Exponentiator " by &egory A. Powell. Mark W. Wilson, Ke^'m Q. Truong, and Christopher P. 
Curren. filed March 28. 1997. which application is hereby incorporated by reference herein. 
[0002] This application is also related to co-pending and commonly assigned application serial number -A-,—, enti- 
tled "High Speed Montgomery Value Calculation." by Matthew a McGregor, f Oed on same date herewith, which appli- 

10 cation is also hereby incorporated by reference herein. 

RACKGRQUND OF THE INVENTiON 
1 Rftldrfthe Invention 

IS 

[0003] The present invention relates to cryptographic systems, and more particularly, to a highly efficient multiplier for 
performing modular reduction operations integral to ayptographic key calculations. 

3 DescripHon of Related Art 

20 

[0004] Cryptogj-aphic systems are conwnonly used to restrict unauttiorized access to messages communicated over 
otiienwise insecure channels. In general, cryptographic systems use a unique key. such as a series of numbers, to con- 
trol an algoritiim used to encrypt a message before it is transmitted over an insecure comirwnkation channel to a 
receiver. The receiver must have access to the same key in order to decode tiie encrypted message. Thus, it is essen- 
25 tial that the key be communicated in advance ty ttie sender to the receiver over a secure channel in order to maintain 
the security of the cryptographic system; however, secure communkjation of the key is hampered by the unavailability 
and expense of secure communication channels. Moreover, the spontaneity of nrwst business communications is 
inpeded by the need to communicate the key in advance. 

VXJOSl In view of tiie diffwutty and inconvenience of communk^ting the key over a secure channel, so-called piiJlic 
30 key cryptographic systems are proposed in which a key may be communfcated over an insecure channel witix)ut jeop- 
ardizing the security of the system. A pubfic key cryptographs system utiHzes a pair of keys in whfch one is pUilidy 
communicated, i.e.. tiie puWk: key. and the other is k^ secret by tiie receiver, i.e.. the private key. WhDe the private k^ 
is maJhematically related to the public key. It is practically impossible to derive tiie private key from the publfc k^ alone. 
In this way. tiie public key is used to encrypt a message, and the private key is used to decrypt the messaga^ 
35 [0006] Such cryptographic systems often require computation of modular exponerrtiattons of the form / = ^ mod rr, 
in which tiie base b, exponent e and modulus n are extremely large numbers. e.g.. having a length of 1 ,024 binary digits 
or bits. H. for exanple, the exponent e were transmitted as a public key, and the base b and modulus n were known to 
the receiver in advance, a private key y could be derived by computing the modular exponentiation. It would require 
such a extremely large amount of computing power and time to factor ttie private key y from tiie exponent e wittiout 
40 knowledge of tiie base b and modulus n. that unauthorized aocess to tiie deaypted message is virtually precluded as 
a practical matter. 

[0007] A drawback of such cryptographs systems is that calculation of ttie oKxIular ©cponentiation remains a daunt- 
ing mathematk^ task even to an authorized receiver using a high speed computer. With tiie prevalence of puWfc com- 
puter networks used to transmit confidential data for personal, business and governmental purposes, it is anticipated 
45 that most computer users will want cryptographic systems to control access to tiieir data. Despite ttie increased secu- 
rity, the diff fculty of the modular exponentiation calculation will substantially drain computer resources and degrade data 
througfvut rates, and ttius represents a major impediment to the wklespreed adoptton of commercial cryptographk: 
systerrSw 

[0000] Accoittingly. a critical need exists for a high speed modular exponentiation metiiod and apparatus to provide 
so a sufficient level of communcation security while minimizing ttie impart to computer system perfonnance and data 
ttYOughput rates. 

SUMMARY OF THE INVENTION 

55 [0009] In accordance with tiie teachings of ttie present inventfon, a higWy efficient mettxxl and apparatus is disclosed 
for performing operations required for modular exponentiation. The apparatus is especially well suited for implementing 
multiplications using ttie Montgomery algoritiim. 

[00101 The efficient mutt^jlier architecture uses a pretoad register, coupled to a muHipUer at a second input port via a 
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KN bft bus to load the value of the "a" multiplicand in the multiplier in a single dock pulse. The "b" multiplicand {which 
is also KN bits long) Is supplied to the muttplier N bits at a time from a memory via an N bit bus coupled to a muKpiier. 
The multiplier multiplies the N bits of the V multiplicand by the KN bits of the "a" multiplicand and provides that product 
at a nrauhq^lier output N bits at a time, where it can be supplied to the memory. 

5 (001 1] The efficient multiplicalion method using the foregoing architecture is also desaibed. The method begins by 
providing KN bits of the multiplicand "a" from a preload register to a second multiplier input port in a single clock pulse. 
Then. N bits of the multiplicand 1)- are provided to a first multiplier input port, also in a single dock pulse. The KN bits 
of the number "a" are multiplied by the K bits of the number t>" until alt of the KN bits of the "b" multiplicand are provkJed 
to the first multiplier input port and multiplied by the KN bits of the "a" multiplicand. When completed, these operations 

10 result in an output number, which is then transmitted to the memory, where it can be made available for further process- 
ing. 

[00121 In accordance with the deterministic behavtor of the Montgomery algorithm, one embodiment of the present 
invertfon loads a predicted (future) value for multipficand "a" Into the preload register while multiplicatfon operalfons on 
the cunrent "a" and "b" multiplicands are being performed. TNs technique further reduces the dock cydes necessary to 
75 load and multiply the parameters. 

[001 31 A more complete understanding of the computationally effident multiplier will be afforded to those skilled in the 
art, as well as a realization of additfonal advantages and objects thereof, by a consideration of the following detailed 
desaiption of the preferred embodiment Reference will be made to the appended sheets of drawings which wHI first be 
desaibed briefly 

20 

BRIEF DESCRIPTION OF THE DRAWINGS 
[0014] 

25 FK3. 1 is a block diagram of an exemplary application of a nwdular exponentiator within a cryptograirfilc system; 
FIG. 2 is a block diagram of the modular exponentiator; 

FIG. 3 is a system level flow diagram of the functions performed by the modular exponentiator; 
FIG. 4 is a ftow chart showing an exponent bit scanning operation pertbnmed by the modular exponentiator; 
FIG. 5a-c are block diagrams of an exponent register within various stages of the exponent bit scanning operatton 
30 Of FIG. 4; 

FIG. 6 Is a now chart showing a multiplication operatfon performed by the modular exponentiator; 

FIG. 7 Is a ftow chart showing a squaring operation performed in conjunction witti the multiplfoatfon operation of 

FIG. 6; 

FIG. 8 is a chart showing an exemplary exponent bit scanning operation in accordance with the flow chart of FIG. 4; 
35 FIG. 9 is a chart showing an exemplary multiplicatfon and squaring operation in accordance with the flow charts of 
FIGs. 6and7; 

FIG. 1 0 is a block diagram showing a system architecture which can be enployed to practice the present invention; 
FIG. 11 is a block diagram showing one embodiment of the multiplier and assodated modules; 
FIG. 12 is a timing diagram showing tiie pre-foading of predictive multiplicands; and 
40 FIGs. 13 and 1 4 are flow charts depicting the multiplication operatfons. 

DETAIl^D DESCRIPTION OF THE PREFERRED EMBODIMENT 

[001 5] The present invention satisfies the need for a high speed modular exponentiation mettiod and apparatus which 
45 provkJes a suff kaent level of communication security while nrtinimizing the impact to computer system performance and 
data througlpul rates. In ti^e detailed description ttiat follows, like element numerals are used to desaibe like elements 
in one or more of the figures. 

100161 Refen-ing first to FIG. 1 . a block diagram of an application of a modiiar exponentiator 20 within an exemplary 
cryptographfo system 1 0 is illustrated. The exemplary cryptographs system 1 0 Mudes a central processing unit (CPU) 

so 12, a random access memory (RAM) 14, a read only memory (ROM) 16, and modular exponentiator 20. Each of the 
elements of the cryptographfo system 10 are coupled togettier by a bi-directional data and control bus 18. over which 
data and control messages are transmitted. The CPU 12 controls the operation of the ayptographic system 10, and 
may be provkied by a conventional microprocessor or digital signal processa drcuit The RAM 1 4 provides tenporary 
data storage for operation of ttie CPU 12, and ttie ROM 16 provides for non-volatile storage of an instruction set, i.e., 

55 software, that is executed in a sequential manner by ttie CPU 12 to control the overall operation of ttie ayptographic 
system 10. The modular exponentiator 20 may comprise a special function device, such as an applkation specific inte- 
grated drcuit (ASIC) or fiekl progranwaWe gate arrcy (FPGA), that is accessed by ttie CPU 12 to perform modular 
exponentiation operations. Alternatively, ttie elements of ttie cryptographic system 10 may all be contained witfiin a sin- 
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gle ASIC or FPGA in which the modular exponentiator 20 is provided as an embedded core process. 
[00171 As known in the art, the cryptographic system provides an interface between a non-secure conimunication 
channel and a data user. The cryptographic system receives encrypted data from an external source, such as a remote 
transmitter (not shown) which Is communicating with the crypto^aphic system over the communication channel. The 

5 encrypted data is decrypted by the cryptographic system, and the decrypted data is provided to the data user. Con- 
versely, the data user provides decrypted data to the cryptographic system for encryption and subsequent transmission 
across the communication channel. The cryptographic system also receives and transmrts various non-enaypted mes- 
sages, such as control data and the public key informatioa It should be apparent that all communications with the cryp- 
tographic system occur via the data and control bus 1 8. 

10 [001 8] The modular exponentiator 20 is illustrated in greater detail in FIG. 2. The modular exponentiator 20 comprises 
an interface logic unit 22. a pair of parallel processing units 24a. 24b, and a RAM 25. which ail communicate internally 
over a data and control bus 27. The interface logic unit 22 controls communkations between the modular exponentiator 
20 and the data and control bus 18 of the cryptographk: system 10 described above. The processing units 24a. 24b 
conrprise respective control units 26a. 26b and muRipfier tmits 28a. 28b^ wNch further oon^se internal circuit ele- 

15 mentsthat execute a modular exponentiatk>n process, as will be further described below. The RAM 25 provkies for tem- 
porary storage of data values generated by the control units 26a. 26b and multiplier units 28a, 28b while executing a 
nrKxIular exponent'atk>n operation. 

[0019] Refening now to FIG. 3 in corijunction with FKx 2 described above, a system level ftow diagram of the func- 
ttons performed by the modular exponentiator 20 is illustrated. As shown at step 101 . the modular exponentiator 20 will 

20 compute a modular exponentiatkxi of the form y « b® mod n, in whk;h the modulus n. base b and exponent e are each 
k bits fong. In a preferred embodiment of the present inventfon, k is 1 ,024 bits. Using conventional methods, solving 
such a modular exponentiation woukf require a tremendous amount of computing power due to tiie large number and 
size of the muttiipacations and modular reductions that must be performed. In the present invention, tiie nKXlular expo- 
nentiation is solved in a highly efficient manner by reducing the size of the problem and by reducing the number off mA- 

25 tqplications that are performed. 

[0020] As a first step in solving the modular exponentiation, the original exponentiatfon is split into components, as 
follows: 

b* modn = (((q"'modp*(br*''modp + p-br'*'mod Q))modp)*q) + br ^''nrwd q 

30 

in which p and q are large prime numbers whereby n=p*q,?oc maximum security, p and q shouM be roughly the 
same size. The term q'^ nxxi p is a special value called an inverse which is derived from tiie Chinese remainder tiieo- 
rem, as known in the art. In particular, q'^ mod p is tiie inverse of q mod p. Since the inverse represents a modular expo- 
nentiation of the same order as 

3$ 

mod p. ti)e inverse may be pre-calculated in advance, and staed in the RAM 25 at step 108. The values Op and are 
40 k^ bit values equal to enxd(p-l) and e mod (q-1). respectively. A reduced base term br fa 

46 modpand 

so mod q is provkied by taking a modular reductfon of b with respect to p and q. respectively. The reduced base terms b^ 
thus have a k/2 bit length as well. 

[0021 J Splitting the modular exponentiatfon permits its solution in two parallel paths, as illustrated in FIG. 3. which are 
processed separately by tiie respective processing units 24a. 24b of FIG. 2. At steps 104, 105. tiie modular exponen- 
tiations 

55 
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mod p and 



5 

mod q are calculated separately using techniques that will be further described below. The b^ terms of each of the two 
modular exponentiations may be pre^alculated in advance, and stored in the RAM 25 at steps 102, 103. 
[0022] Since p and q are each respectively k/2 bits in length, the magnitude of the respective problems is thus reduced 
substantially from its original form. Moreover, the parallel calculation of two reduced-size modular exponentiations 
10 requires siisstantially less computer processing time than a con'esponding calculation of the original modidar exponen- 
tiation within a single processing imit. The reduction in processing time results from the feci that the number of multi- 
plies needed to perform an exponentiation with an efficient algorithm (such as described below) is proportional to 2^ 
+ s. where s is equal to kdlvided by the multiplication operand size in bite. If an s word problem was treated as two sep- 
arate s/2 word problems, the number of muHipfy operations per exponentiatfon is reduced to a value proportional to 

15 




F6r exanple. if kwere 1.024 bits and the multiplication operand were 128 bits, s would be equal to 8. Accordingly, an s 
word problem would require a number of multiply operations proportional to 1 36, whfle the two s^rate s/2 word prob- 
lems would respectively require a number of multiply operations proportional to 36. Thus, the number of multiply oper- 
ations Is reduced by 3.778 times. 
25 1002ZI Foflowing the calculations of steps 104,105, the 



br^P 

30 mod q term is subtracted from 

br^P 



mod p and the result is added to p at step 106. At step 107. the resulting sum is multiplied by the inverse q'^ mod p 
which was pre-calculated at step 108. This step may be performed by one of the muHi)llers 28a, 28b, which are opti- 
mized tor modular operattons as will be further described below. The resulting product is modularly reduced with 
respect to p at step 1 09. and further multiplied by q at step 1 1 0 to produce a k-bit value. Lastfy. the product of that final 
multiplication is added to 



br^^ 



mod Q at step 1 1 1 . which was previously calculated at step 1 05. It should be appreciated that the modular reduction 
45 that occurs at step 109 is much easier tiian the original modular exponentiation in view of the substantial reduction in 
size of the original tf term. This final solution to the modular exponentiation is provided to the data and control bus 18 
for furtfier use by the CPU 1 2. 

10024] Referring now to FIGs. 4 and 5a-c. the modular exponentiations (rf 



mod p and 



55 



mod q from steps 104. 105 of Fia 3 are shown in greater detail. Spedfically. FIG. 4 illustrates a ftow chart describing 
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a routine referred to herein as exponent bit-scanning, which is used to reduce the nurrter of multiplications necessary 
to perform an exponentiation. In general, the exponent bit-scanning routine factors the exponentials 

b/P 

and 

10 

into a product of precomputed powers of the reduced base bb modularly reduced with respect to p or q. The routine may 
be coded in finnware and executed sequentially by the respective processing units 24a. 24b described above in the 
form of a software program. Alternative^, the routine may be hardwired as disaete logic drcuits that are optimized to 
IS perform the various functions of the exponent bit-scanning routine. For convenience, the description that follows will 
refer only to the operation of the exponent bit scanning routine with respect to the exponential 

20 

but it should be appreciated that a similar operation must be performed wHh respect to the exponential 

25 

[0025] Tlie exponent bit-scanning routine is called at step 200, and a running total is initialized to one at step 201 . An 
exponent e^ to be bit-scanned is loaded into a register at step 202. FIGs. 5a-c illustrate a k-bit exponent e (i e^.i - 
Cq) loaded into a register 32. The register 32 may comprise a predefined memory space within the RAM 25. First, a win- 
so dow 34 is defined through which a limited number of bits of the exponent e are accessed. A window size of three bits is 
used in an exemplary embodiment of the present inventioa though it should be appreciated that a different number 
could also be advantageously utilized. The window 34 is shifted from the left of the register 32 unfa! a one appears in 
the most significant bit (MSB) of the 3-bit window, as shown by a loop defined at steps 203 and 204. In step 203, the 
MSB is checked for presence of a one. and if a one is not detected, the window 34 is shifted by one bit to the right at 
35 step 204. FIG. 5b illustrates the window 34 shifted one bit to the right. It should be apparent that steps 203 and 204 will 
be repeated until a one is deeded. 

[0026] At step 205, a one has been detected a the MSB. and the value of the three-bit binary number in the window 
34 is a read. The number is necessarily a 4. 5. 6 or 7 p.e.. binary 1 00, 1 01 . 11 0 or 1 1 1 . respectively) since the MSB is 
one. At step 206, a pre-corrputed value for the reduced base bf raised to the numt)er read from the window 34 (i.e., 
40 bp, bp or b/, respectively) is fetched from memory. This pre-conputed value is multiplied by a running total of the 
exponentiation at step 207. It shouM be appreciated that in the first pass through the routine the running total is set to 
one as a default 

[0027] Thereafter, a loop begins at step 209 in which the register 32 is checked to see if the least signif k^ant bit (LSB) 
of the exponent Bp has entered the window 34. Significantly, step 209 checks for the LSB of the entire exponent Sp, in 

45 contrast with step 203 which reads the MSB of the window 34. If the LSB has not yet entered the window 34. the loop 
continues to step 21 2 at which the window 34 is successively shifted to tfie right arxJ step 213 in whk;h the running total 
is modular squared with each such shift The loop is repeated three tones until the previous three bits are no kmger in 
the window 34. i.a, three shifts of the window. Once three shifts have occun'ed. the routing detemiines at step 216 
whether the MSB is ona H sa the routine returns to step 205, and the value in the window 34 is read once again. Alter- 

60 natively, if the MSB is zero, tiien the register 32 is again checked at step 21 7 to see if the LSB of the exponent Bp has 
entered the window 34. If tiie LSB is not in the window 34. tiie k)op inckiding steps 212 and 213 is again repeated witii 
the window again shifted one bit to the right and tiie running total nrKxiular squared with the shift 
[0028] If. at step 217. tiie LSB has entered the window 34. this indicates that the end of the exponent ep has been 
reached and the exponent tHt-scanning routine is almost completed. At step 222, tiie last two bits in tiie window 34 are 

55 read, and at step 223 the running total is multiplied by the reduced base bf the number of tintes tiie value read in the 
window. For example, if the value of the lower tow bits is a one, two. or three (i.e.. binary 01 , 1 0 or 1 1 , respectively), then 
tiie previous mnning total is multiplied by the reduced base bf one. two or tiiree times, respectively If tiie value of tiie 
lower two bits is a 0, tiien the running total is not changed (i a. multiplied by one). Then, tiie exponent bit-scanning rou- 
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10 



tine ends at step 224. 

[0029] Returning to step 209 discussed atxjve. before the loop begins, the register 32 is checked to see H the LSB of 
the exponent Cp has entered the window 34. If the LSB has entered the window 34, a series of step are performed in 
which the count value is checked. The count value keepe track of the number of passes through the above-described 
loop that have taken place. H the count value is three, indicating that all of the bits In the window 34 have been previ- 
ously scanned, then the exponent bit-scanning routine ends at step 224. If the count value is two. then all but the last 
bit in the window 34 has been prevrausly scanned, and at step 221 . the value of the last bit is read. H the count value is 
one, then only the f irst bit in the window 34 has been previously scanned, and at step 222. the value of the last two bits 
is read (as already described above). Once again, at step 223 the running total is multiplied by the reduced base the 
number of times the value read In the window. Then, the exponent bit-scanning routine ends at step 224. - 
[0030] An example of the exponent bit-scanning technique is IDustrated in FIG. 8 with respect to a modular exponen- 
tiation of a tese b raised to a ten-bit exponent e, in whfch e =« 101 101001 1. "me successive shifts reduce the exerolary 
temi bioiiooiooii ^ mmb^^f)*b^^^f*b\ Since the term b^ was precateulated and fetched from memory, 
processing time is saved tjy not having to calculate that term. In addition, there are additional processing time savings 
15 that are achieved in performing a modular reductwn of the exemplary term with respect to n due to the distributive 
nature of modular reduction. Rather tinan a huge number of multiplications followed by an equally huge modular reduc- 
tion, only nine multiplications and modular reductions are required, and the modular reductions are smaller in magni- 
tude since tiie intenrtediate values are smaller. 

[0031 1 It should be appreciated tiiat the modular squaring step thai occurs with each shift is necessary since the expo- 
20 nent bit-scanning begins at the MSB of the exponent Bp where the window value is not really 4. 5, 6 or 7, but is actually 
4, 5, 6 or 7 times 2^ where k Is the exponent bit position for the window's LSB bit Since ttie value of the exponent is 
interpreted as a power of the base 6^ a tactor of 2*^ implies squaring k times. Multiplying by a precalculated value when 
the window MSB is one is used to Insure that all ones in the exponent Cp are taken into account and to reduce the total 
number of pre-calculated values that are needed. 
25 [0032] Even though the exponent bit-scanning routine has reduced tiie number of multiplications that have to be per- 
formed tn the respective calculations of 



30 



hJBp 

modpand 



35 

mod q, tiiere still are a riimber of multiplications tiiat need to be performed. The modular exponentiator 20 utilizes an 
efficient multiplication algorithm for modular terms, referred to in the art as Montgomery multiplfcation. The Montgomery 
algorithm provides that: 

^ Mont(a,b) = modn 

2* 

where k is the number of bits in tine nrKxfulus a n is relatively prime to 2*^. and n>a, n>b. In order to use the algorithm 
45 for repeated multplies. the values of a and b must be put into Montgomery form prior to perfonrtng the Mon^mery 
multiply, where: 

so 

If tiie two values to the Montgomery multiplied are in Montgomery form, tiien the result will also be in Montgomery form. 
[0033] FIG. 6 illustirates a flow chart descrtoing a Montgomery multiplication operation executed by the modular expo- 
nentiator 20. As witii the exponent bit-scanning routine descrtoed above with respect to FIG. 4, the Montgomery multi- 
plication operation may be coded in firmware and executed sequentially wittiin tiie respective processing units 24a, 24b 
55 by the control units 26a. 26b which access ttie multipliers 28a. 28b for partfcular aspects of the operation, as will be fur- 
ttier described below. Alternatively, the Montgomery multiplication routine may be hardwired as disaete logic drcuits 
tiiat are optimized to perform the various functions of the routine. 

[0034] As illustrated in Fia 6, tfie Montgomery multiplicatkm routine includes a major loop and tiwo minor toops. In 
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each major loop, a distinct word of a muitiplicand b| is multiplied tDy each of the words of a multipiicand aj. where j is the 
number of words in multiplicand a, and I is the number of words in multiplicand bj. The Montgomery multiplication routine 
Is caHed at step 301. The two multiplicands aj and bj are loaded into respective registers at step 302, along with a 
square flag. If the two multiplicands aj and bi are ec]^ the square flag is set to one so that a sq 
5 routine may be called at step 400. The squaring speed-up subroutine will be described in greater detail below, if the two 
mult^icands a^ and b| are not equal, then the square fiag is set to zera 

[0035] Before initiating the first major loop, i is set to be equal to one at step 305 so that the first word of multiplicand 
b{ is accessed. The square flag is checked at step 306 to determine whether the squaring speed-up sid>rout}ne should 
be called, and if not j is set equal to one at step 307. The two words a, and b; are multiplied together within the fn^st 

10 mina loop at step 308. and the product added to the previous carry and previous C|. It should be appreciated that in the 
first pass through the routine, the carry and ^ values are zero. Tbe lower wad of the result is staed as C| and the higher 
word of the result is used as the next carry. The first minor loop is repeated tjy incrementing j at step 31 0 until the last 
word of aj is detected at step 309, which ends the first minor loop. Before starting the second minor loop; a special 
reduction value is calculated that produces all "O's for the lowest word of Cj when multiplied with Cj, and j is set to two at 

15 st^ 31 1 . Thereafter, at step 31 2, the special reduction value is multiplied by the modulus nj, added to the previous cany 
and Cj. The lower word of the result is stored as c^i and the higher word of the result is used as the next carry. The sec- 
ond minor loop is repeated by inaementing j at step 31 4 until the last word of Cj is detected at step 313, which ends the 
second minor loop. Once the second minor loop ends, i is incremented at step 316 and the major loop Is repeated until 
the last word of b-, has passed through the major loop. Then, the modular reduction of the final result of with respect 

20 to n is obtained at step 31 7, and the Montgomery muhiprication routine ends at step 31 & An example of a Montgomery 
multiplication of aj with bj in which both multiplicands are fbur words long is provided at FKa. 9. In the example, the sym- 
bol £ is used to denote the combination of an previous values. 

[0036] Jhe Montgomery multiplication routine of FKB. 6 can be speeded up when used to square a number by recog- 
nizing that some of the partial products of the multiplication are equal. In particular, when multiplicand aj is equal to mul- 

25 tplicand t)|, i.e., a squaring operation, then the partial products of various components of the multiplication would 
ordinarily be repeated, e.g., the partial product of a2 with ba is equal to the partial product of aa witii b2. As illustrated in 
FIG. 9, both of these partial products occur during ttie tiiird maja loop iteration. Thus, tiie first time the partial product 
is encountered it can be multiplied by two to account for the second occurrence, and a full multiplication d the second 
partial product can be sMpped. Multiplication by two constitutes a single left shift for a binary number, and is signTicantiy 

30 faster than a full multiplication operatioa It should be appreciated that a great number of squaring operations are per- 
formed by tiie rrxxlular exponentiator 20 due to the operation of the exponent bit-scannarig routing described above, and 
an inaease in speed of the squaring operations would have a significant effect on tiie overall processing time for a par- 
ticular modular exponentiation. 

[0037] FIG. 7 illustrates a flow chart describing the squaring speed-up subroutine, which is called at step 401 . Initially, 

35 j is set to be equal to I at step 402, which, in ttie first iteration of the major loop of FIG. 6, will be equal to ona In subse- 
quent iterations of tiie major loop, however, it should be apparent that j will begin with tiie latest value of i and will thus 
skip formation of partial products that have already been encountered. At step 403, i is compared to j. If i is equal to j, 
then at step 405 a factor is set to one, and if i and j are not equal, then at step 404 the factor is set to two Thereafter, 
in step 406. aj and bj and the factor are multiplied together the product added to tiie previous can-y and Cj. As in step 

40 308 of FIG. 6. the lower word of the result is stored as cj and the higher word of the result is used as the next carry. After 
conrpleting ttie mid tiplication step 406. j is incremented at step 408 and tiie loop is repeated until the last word of has 
passed through the loop, at which time tiie squaring speed-up subroutine ends at step 409. At step 410 of FIG. 6. ttie 
Montgomery nxittiplication routine resumes just after ti^ first minor loop. It should be appreciated that tiie squaring 
speed-up subroutine will operate in place o^ the first minor loop for every iteration of the major loop of tiie Montgomery 

45 multiplication routine when tiie squaring flag is set. 

[0038] In order to peribrm the Montgomery multiplication routine more effidentiy. tiie multipliers 28a. 28b are tailored 
to perform specific operations. In particular, the multipliers 28a, 28b include specific functions for multiplying by two 
(used by the squaring speed-up routine), executing an a*b4C function, and peribrming tiie mod 2" function on a 2n-bit 
result wNle leaving tiie higher n bits in a cany register. 

so [0039] FIG. 1 0 is a chart showing a block diagram of a system architecture which can be employed to practice the 
present invention. In this embodiment, the architecture is implemented on an ASIC 500. ASIC 500 comprises a CPU 12 
with a processor 502, which performs operations required to implement the present invention. In one embodiment, 
processor 502 comprises a reduced instruction set (RISC) POWERPC™ 401 core processor available from tiie IBM™ 
Corporation. Processor 502 provides a trace interlace 504 and a watch interface 506, and obtains instructions via an 

55 external FLASH/SRAM memory interface module 520 and a 32 bit external memory interiace 522. The trace interface 
504 and the watch interiace 506 provide for error detection and debugging. To enhance peribrmance. processor 502 
interfaces witii the ASIC nxxiuie bus 524 via a selectable data cache 508 and an instruction cache 510. The ASIC 500 
interiiace \og\c 22 comprises a general 1/0 module 516 witti a 4 bit external interface 518, an external memory interface 
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module 520 and associated Interfiace 522. and a PCI interface module 512 and associated PCI interface 514. The PCI 
interface 51 4 provides a 32 bit data channel nominally operating at 33 MHz. The PCI interface module 51 2 provides the 
operations necessary for coirtpliance with the PCI interface I/O and command protocol, including built-in input and out- 
put first input first output (FIFO) buffers for efficient data transfer. Data transfer among other modules in the ASIC 500 
5 is provided by the ASIC nnodule bus 524. The ASIC 500 also optlonaliy comprises a high speed dedicated randorin 
number generator 526, for key generation and padding. In accordance with the principles described herein, the ASIC 
500 also comprises a modular exponentiator 20. which Includes pair of paraRei processing units 24a 24b. each associ- 
ated with a RAM 25. 

[0O4O] FIG. 1 1 presents a more detailed view of the processing units 24a.b, the associated RAM 25, and control units 
10 26a.b. The processing unit 24a.b comprises a multiplier 602, a preload register 604. a memory 25, and a multiptexer 

606. A control unit 26a.b. operatively coupled to the multiplier 602. preload register 604, memory 25 and multiplexer 606 

controls the operation of these respective devices, in accordance with a dock signal provided by dock 608. 

[0CM1] It is desirable to perfonrri 1024 bit RSA calculations such as modular exponentlattons as quickly as possible. 

preferably in less than 5 ms at a 33 MHz dock speed. Although the 1024 bit RSA cak^ulations can be reduced to 512 
75 bit calculatkms using the above teaching, this still leaves the problem of performing two 512 bit cak^lations within the 

5ms interval. 

[0042] Ordinarily, multiplier 602 would comprise a 64 bit bus for each input number to be multiplied. However, with 
such a design, the number of dock pulses necessary to input both values from a 64 bit bus would be too large to sup- 
port a 5ms calculation speed with a 33 MHz dock. The present invention provides this high speed capability writh a 

so unique architecture that indudes a 512 bit multiplier input port coupled to a prek)ad register, and a control unit that 
enforces an effident computation protocol to effk»ently perform 512 by 512 bit multiplk^tions. Further, t>ecause of the 
predkitable nature of the computations required in performing Montgomery multiplications, the contrd unit 26a.b 
enforces a computation protocd that minimizes the clock cycles to input a new number into the prek)ad register. 
[0043] In accordance with the foregoing, the nuilt^ller comprises a first input port 610 with N bit capadty. where N is 

2S an integer greater than one, and a second input port 612 with a K*N (hereinafter KN) bit capacity, where K fe a integer 
greater than one. The illustrated embodiment depicts a system wherein N=64. and K=8, representing the situation 
where the first input port is a 64 bit parallel input port, and the second input port is a 512 bit parallel port Selecting the 
capacity of the multiplier first Input port 610 to be less than that of the multiplier second port 612 minimizes system 
resource requirements without substantially impacting the throughput of the multiplier 602. That is because the muKi- 

30 plier 602 only operates on 64 bits of the number at the first input port 61 0 (the V multiplicand) every four docks as the 
multiplication is taking place. 

[0044] To oontrd the value of multlpUcand "a** at port 61 2 in each successive multiplication, inputs to the pretoad reg- 
ister 604 (representing the multiplicand "aT) can be provided by the multiplier 602 (from a multiplier output port 61 4) or 
the menwy 25 (from a memory output port 616) under seledaWe control of the multiplexer 606 and the control unit 

35 26ajt). For example, the Montgomery algorithm dictates that the desired value for "a" in the next calculation is often the 
same as the value for V in the preceding multiplication (see for example, FIG. 9). In such cases, the preload register 
604 does not require a new value for "a", and the control unit 26a,b will retain the previous value for V in the preload 
register, and provide it to the multiplier 602 when necessary A data path ts also provided from the multiplier output port 
614 to the preload register 604 to altow immediately needed results to bypass the memory 25. thereby redjcing mem- 

40 ory bus traffic. 

[0045] Presuming that there is a first number (!>") staed in the memory 25, and a second number ("a") toaded into 
the pretoad register 604, the multiplication of a*b takes place as fdlows. In the first ctock cyde. the fiill 512 bit value for 
the second number ("aT is input from the preload register 604 to the multiplier 602. Next, the first 64 bits of the first 
number CbT is loaded into the multiplier 602. Then, over the next 3 dock cydes. the 64 bit first number ("bl is multiplied 

45 the 51 2 bit second number ("al. The next 64 bits of the first number ("b") are then loaded Into the multiplier 602. and 
that portion of "b" is mult^Mied by the 51 2 bit second number ("a"). This process is repeated until all bHs of the "b" nrtul- 
tiplicand are multiplied by all bits of the "a" multiplicand. Loading and multiplying all of the bits of "b" by those of "a" takes 
8*4 = 32 dock cydes. After 4.dock cydes of multiplier 602 internal processing, the output representing the least sig- 
nifkant 512 bits of the product of the first nurrber fb") and the second number fa"), is outputted over the next 8 clock 

so cydes. The most SfgnrTicant 512 bits d the product remain in the multiplier 602. and are used for further carry opera- 
tk)ns. Accordingly. 45 clock cydes are required to determine the product of "a" and "b." 

[0046] Arthough the data channel 622 from the preload register 604 to the multiplier 602 is 51 2 bits, the bus capacity 
to all other input and output ports, including the memory 25 is only 64 bits. Therefore, in cases where a pretoad value 
is required (a new "a" value), an acWitional 8 dock cycles would ordinarily be required to load the value from the memory 
55 25 to the preload register 604 from the 64 bit data channel. This would mean that for any multiplication requiring a new 
"a" value, the number of required dock cydes to complete the operation would be 45+8 = 53. To avoid this problem the 
contrd unit 26a.b of the present inventfon invokes a cfifferent command protocol when a new "a" value is expected. This 
protocd makes use of the 64 bit input bus during the three docks after each 64 bit "b" value is siqsplied to the nrultiplier 
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602. In particular, the predicted, future value needed for the next muitipiication is fetched from the memory 25 and 
directed to the preload register 604 during the dock period following the input of the V value to the multiplier 602. The 
predicted future value for V is ascertainable due to the deterministic nature of the Montgomery multiplication routine, 
which frequently uses the same "a" value while varying only the V value. 

[0047] FtQ.12 is a timing diagram illustrating the foregoing logic. Trace 702 represents the signal fifcm the clock 608. 
Trace 704 indicates the clock cycles where values for "b" are supplied to the multiplier 602 fi'om the menrx>ry 25. Since 
the bus connecting the memory 25 output port 616 and the multiplier first input port 610 is a 64 bH bus, values for the 
512 bit nun*er V are supplied to the multiplier 602 in 64 brt Increments. Accordingly, location 708 on trace 704 indi- 
cates where the first 64 bits of the 512 bit number V are transferred to the multiplier 602 via the multiplier first input 
port 610. At a clock pulse after the dock pulse in wNch the first 64 bits of the "b" value was transferred to the multiplier 
602. 64 bits of the "a" value for the next multiplication are transferred from the mennory 25 to the preload register 604. 
This is indicated at the pulse 710 on trace 706. The foregoing can also be implemented wHh pulse 710 occun'ing two or 
more cydes after the cyde loading the "b" information as weB. Ttiis process is repeated until all bits representing "b" 
have been loaded into the nultiplier 602 and all bits represent^ the new "a" value have been pre-k)aded into the 
preload register 604. 

[0048] FIGs. 13 and 1 4 are flow charts depicting the multiplication operations of one embodiment of the present Inven- 
tion. Rrst. as shown in bkx:k 802, KN bits of "A" are provided from the preload register 602 to the multiplier 602 in the 
multiplier second input port 612. This is accomplished in a single dock pulse. Next, N bits of "b" are provkled to the first 
input port 610 of the multiplier 602 in a single dock pulse. This is shown in block 804. 

[0049] In the Montgomery algorithm, the operand "a" is often used in successive calculations, and can also be pre- 
dkrted firom past value& Because of this deterministic nature^ the value for "a" for in successive cateulations may be pre- 
dicted. If a new value is predicted for "a" in following computattons, N bits of the predtoted "a" value is provided from the 
memay 25 to the preload register 604 in a single dock. This can be performed in a clock pulse foUowing the pulse pro- 
viding the N bits of "b" to the multtplier. and is depicted in blocks 806 and 814. By providing the predided "a" value from 
the memory 25 at this time, a potential bottleneck in data flow for new "a" values is minintized, as described above with 
reference to FIG. 12. If a new value for "a* is not anticipated, the logic from block 806 proceeds to block 808, where the 
KN bits of V are multiplied by ttie N bits of V 

[0050] TNs process is completed until all KN bits of !>' have been multiptied by all KN bits of "a." as illustrated in block 
61 0, resulting in the output number provkied in block 812. Then, as shown in bkxk 81 4. N bits of the output number are 
provkJed to the multiplier output port in a single dock pulsa If the cun'ent output value from the multipfier 602 is required 
for the next nwlt'plkatk>n, N bits of the output number are provkied to the pretoad register 604. This is iDustrated in 
blocks 816 and 818. If not. logic proceeds to block 820. where the N bits of the outpUnunfA>er are provkJed to the mem- 
ory 25. As depicted in block 822, the operations performed in bk)cks 814 through 822 are repeated until all KN bits of 
the output number are provkied to the memory 25. 

[0051 ] Using ttie foregoing techniques, the multiplier 602 is capable of efficiently performing a nunrt>er of operations 
on "a* and "b.* in addition to mult|plKation. These operations are described in Table 1 bebw: 



Table 1 



Address 


Instruction 


Control Word Description 


0000 


a + b 


Read value "a' from tiie memory 25, and add it to an accumulator in the mul- 
tiplier 602. TNs can be accomplished by performing the operation [(a* b) + 
acc] where either "a* or l)" are set to one. 


0001 


a*b+acc 


Read "a" and V from tiie memory 25 and execute a multiply and accumulate 
function. The LS6 of the result is stored back in the memory 25. All data trans- 
fer between the multiplier 602 and tfie memory 25 occurs with the LSBs first, 
with the memory address decrentented by one after each memory read or 
nnemory write operation. 


0010 


a*b + acc 


Use previous value of "a." read "b" from memory 25 and execute multiply and 
accumulate functioa The LSBs of tiie result are stored back in the memory 
25. 


0011 


Saveacc 


Store accumulator value in the memory 25, and dear accumulator. 


0100 


Save acc and overffow 


Store accumulator and overflow In the memory 25. Clear accumulator and 
overflow. 
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Table 1 (continued) 



Address 


Instruction 


Control Word Description 


0110 


((a*b)*2) + aoc 


Use previous value of "a,** read "b" from the memory 25, and execute multiply 
and accumulate * 2 function. The LSBs of the result are stored back in the 
memory 25. 


0111 


Clear aoc and overflow 


Clear the accumulator and overflow. 



10 CONCLUSION 



[0052] A computationally efficient multiplication apparatus and method especially well suited to modular exponentia- 
tion has been described. The apparatus uses a preload register, coipled to a multiplier via a KN bit bus to load the value 
of the "a" mult^jficand in the multiplier in a single clock pulse. The "b" multiplicand (v4iich is ateo KN bits long) is sup- 

IS pfied to the multiplier N bits at a time from a memory via an N bit bus. The multiplier multiplies the N bits of the "b" mul- 
tiplicand by the KN bits of the "a" multiplicand uaitil all KN bits of V are multiplied by the KN bits of "a." 
[00531 The method provides KN bits of the multplicand "a" from a preload register to ttie multiplier in a single clock 
pulse. Then. N bits of the multiplicand "b" are provided to the multiplier, also in a single dock pulse. Next, the KN bits of 
the number "a" are repeatedly multiplied by the N bits of the number b until all of the KN bits of the "b" multiplicand are 

20 provided to the first multiplier input port and multiplied by the KN bits of the "a" multiplicand. When con^eted. these 
operations result in an output number, which is then transmitted N bite at a time to the memory, where it can be made 
available for further processing. 

[0054] bi accordance with the deterministic behavior of the Montgomery algorithm, one embodment of the present 
Invention loads a predicted (future) value for nwltiplicand "a" into the preload register while multiplication operations on 
2S the current "a" and "b" multiplicands are being perfonned. This technique furttier reduces tiie dock cycles necessary to 
load and multiply tiie parameters. 

[00551 H should also be appreciated that various modifications, adaptations, and alternative embodiments of the com- 
putationally ^Ident multiplier may be made within tiie soope and spirit of ttie present invention. For exanvle. while ttie 
present InverTticri is well siited io aypiugraphic systems implemented with special purpose processors, it is also useful 
30 in non-cyrptographk3 systems and may be implemented in general purpose processors as well. In such cases, one or 
more computer-executable programs of instructions implementing the invention may be tangibly embodied in a conpu- 
ter-read^e program storage devtoe such as a f toppy disk or other storage media 

Clalnis 

3S 

1 . A method fa performing multiplication of a first number representaUe by KN bits and a second number represent- 
able by KN bits, where K and N are positive Integers and KN is the product of K and N, comprising tiie steps of: 

(a) provkling KN bits of the second number from a prek)ad register to a multiplier second Input port in a single 
40 dock pulse; 

(b) providing N bits of tiie first number to a multipiier first input port from a memory in a single dock pulse; 

(c) multiplying tiie KN bits of the second number times ttie N bits of the first nunter; and 

(d) repeating steps (b) and (c) until all KN bits of the first number have been multiplied by all KN bits of the sec- 
otkJ number to generate an CHitput number. 

45 

2. The mettwd of dalm 1 . further conprlsing the step of: 

(a) provkiing N bits of a predkrted second number having KN bits from the memory to ttie preload register in a 
single dock pulse after perfbmiing ttie step of provkJing ttie N bits of ttie first number to ttie multiplier first input 

so portend 

(b) repeating step (a) fbr each of ttie KN bits of ttie predicted second number. 

3. The mettiod of claim 1 . further comprising the steps of: 

55 (e) provkfing N bits of the output number to a multiplier output port in a single dock pulse; and 

(f) repeating step (e) until all KN bits of the output number have been provkled to ttie output port 

4. The mettiod of daim 1 . further comprising ttie steps of: 
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(a) pro^/iding N bits of the output number to the memory in a single clock pulse; 

(b) repeating step (a) until an KN bits of the output nuirt^er are provided to the memory. 

5. The method of claim 1 , further comprising the steps of 

(a) providing N bits of the output number to the preload register and the memay in a single dock pulse; 

(b) repeatmg step (a) untH all KN bits of the output number are provided to the preload register and the memory. 

6. The method of daim 4. wherein the second number is provided to the preload register selectively from the memory 
output port and the mult^lier output port via a multiplexer, the mUtiplexer coupled to the memory output port the 
multiplier output port, and the preload register. 

7. The method of claim 1 . further comprising the steps of: 

(a) providing N bits of the second number from the memory to the preload register in a single dock pulse; 

(b) repeating step (a) untU ail KN bits of the second number are provided to the prek)ad register. 

8. A computational apparatus, comprising: 

a multipfier, for multiplying a first number representaWe by N bits and a second number representable by KN 
bits to generate an output, wherein K and N are positive integers, the multiplier comprising a first input port for 
accepting a first number, a second input port for accepting a second input number, and an output port: 
a memory for storing the output, the memory comprising a memory input port communicatively coupled to the 
multiplier output port via a first N bit data channel and a memory output port communicatively coupled to the 
muttipfier first input port via a second N bit data channel; and 

a pretoad register for accepting and storing the second number, the preload register communicatively coupled 
to the multipfier second input port via a KN bit data channd. 

9. The apparatus of daim 8. wherein the preload register is commurMcatively coupled to the nnulliplier output port 

1 0. The apparatus of daim 8. wherein the preload regpster is communicatively coupled to the memory output port 

11 . The apparatus of daim 8, wherein the preload register Is convnunfcatively coupled to the multiplier output port and 
the memory output port via a multiplexer, the multiplexer for selecfcaWy controlling communicative coupling between 
the preload register, the nujltiplier output port, and the nnemory output port. 

12. The apparatus off daim 8. further comprising a controller operatively coupled to the multiplier, the multiplexer, the 
preload register, and the memory, the oontrdler implementing the steps of: 

(a) providing KN bits of the second number from the preload register to the multiplier in a f irst dock cyde; 

(b) providing N bits of the first number from the memory to the multiplier in a second dock cyde; 

(c) multiplying the KN bits of the second number times the N bits of the first nun*er wWIe providing N bits of a 
predided second number from the memory to the pretoad register a dock cycle following the second dock 

cyde; and 

(d) repeating steps (b) and (c) until all KN bits of the first number have been multiplied by the KN bits of the 
second nutrter and all KN bits of the predicted second number have been loaded into the preload register. 

13. An apparatus for multiplying a fffsl number representable by KN bits and a second number representable by KN 
bits, comprising: 

means for providing KN bits of the second number from a pretoad register to a multiplier second input port In 
a single dock pulse; and 

means for repeatedly providing N bits of the first number to a multiplier first input port from a memory via an N 
bit data channel in a single clock pulse and for repeatedly multiplying the KN bits of the second number times 
the N bits of the first number until all KN bits of the first number have been multipried by alt KN bits of tiie sec- 
ond number to generate an output number. 

14. The apparatus of daim 13. further comprising means for repeatedly providing N bits of the output number to a mul- 
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tiplier output port in a single clock pulse until all KN bits off the output number have been provided to the output port. 

15. The method of claim 14, wherein the second number is provided to the preload register selectively from the mem- 
ory output port and the multiplier output port via a multiplexer coupled to the memory output port and the multiplier 
output port and the preload register. 

16. The apparatus of daim 13. further comprising means for repeatedly providing N bits of the output nunijer to the 
preload register and the memory In a single clock pulse until all KN bits of the output number are provided to the 
preload register and the memory. 

17. The apparatus of daim 13. further comprising means for repeatedly providing N bits of the second number from the 
memory to the pretoad register in a single ck)Gk pulse until all KN bits of the second number are provkled to the 
preload register 

18. The apparatus of claim 13. further comprising means for repeatedly providing N bits of a predicted second number 
having KN bits from the memory to the preload register in a single dock pulse after perfomriing the step of providing 
the N bits of B to the multiplier first irput port 

1 9. A computational apparatus, comprising: 

a multiplier, having a multiplier first input port for accepting a first number, a multiplier second input port for 
accepting a second input number, and a multiplier output port providing an output representing a product of the 
first number and tiie second number computed over K dock cydes; 

a memory, for storing a first number and a second number, the memory having a menwry Input port operatively 
coupled to the multiplier output port and a memory output port operatively coupled to the multiplier first irput 
port, and 

a preload register, for accepting and storing the second number over K dock cydes, and transmitting the sec- 
ond number to the first multiplier input port in a single cyde. 

20. The apparatus of daim 19. wherein the preload register is communicatively coupled to the multplier output port and 
the memory output pwt via a multiplexer, the nuiltiplexer for selectaWy controlling communicative coupling between 
the prefoad register, the multiplier output port, and the memory output port 
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